library(tidyverse)
library(tidymodels)
library(ggplot2)
library(corrplot)
library(ggthemes)
library(kableExtra)
tidymodels_prefer()
set.seed(3435)
diamonds_split <- initial_split(diamonds, prop = 0.80,
strata = price)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
You’ll notice that the textbooks use the lm() function
to fit a linear regression. The tidymodels framework,
however, has its own structure and flow, which is designed to work with
multiple different machine learning models and packages seamlessly.
To fit any model with tidymodels, the first step is to
create a recipe. The structure of this recipe is similar to that of
lm(); the outcome is listed first, then the features are
added:
simple_diamonds_recipe <-
recipe(price ~ ., data = diamonds_train)
Note that . is a placeholder for “all other variables.”
If we call the recipe object now, we can see some information:
simple_diamonds_recipe
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 9
More specifically, we see that there are 9 predictors.
We should dummy-code all categorical predictors. We can do that
easily with step functions:
diamonds_recipe <- recipe(price ~ ., data = diamonds_train) %>%
step_dummy(all_nominal_predictors())
Note that we haven’t specified what type of model we’ll be fitting yet. The other beauty of the recipe is that it can then be directly given to one of many machine learning model “engines.”
Running the above code is essentially like writing down the
instructions for a recipe on a sheet of paper. We’ve prepared the recipe
to give to the workflow, but we are probably interested in seeing what
the results of the recipe itself actually look like. Did the dummy
coding work, for example? To apply the recipe to a data set and view the
results, we can use prep(), which is akin to setting out a
mise en place of ingredients, and bake().
We’ll also use kbl() and kable_styling()
from the kableExtra
package, which we installed above. It’s not necessary to use these
functions, but doing so allows the table of data to display more neatly,
so that all the columns and rows are actually legible. We also use
head() to select only the first few rows; otherwise the
entire data frame would print, which would be time-consuming. Also note
the use of scroll_box(), allowing us to scroll through the
entire data set.
prep(diamonds_recipe) %>%
bake(new_data = diamonds_train) %>%
head() %>%
kable() %>%
kable_styling(full_width = F) %>%
scroll_box(width = "100%", height = "200px")
| carat | depth | table | x | y | z | price | cut_1 | cut_2 | cut_3 | cut_4 | color_1 | color_2 | color_3 | color_4 | color_5 | color_6 | clarity_1 | clarity_2 | clarity_3 | clarity_4 | clarity_5 | clarity_6 | clarity_7 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.23 | 61.5 | 55 | 3.95 | 3.98 | 2.43 | 326 | 0.6324555 | 0.5345225 | 0.3162278 | 0.1195229 | -0.3779645 | 0.0000000 | 0.4082483 | -0.5640761 | 0.4364358 | -0.1973855 | -0.3857584 | 0.0771517 | 0.3077287 | -0.5237849 | 0.4921546 | -0.3077287 | 0.1194880 |
| 0.21 | 59.8 | 61 | 3.89 | 3.84 | 2.31 | 326 | 0.3162278 | -0.2672612 | -0.6324555 | -0.4780914 | -0.3779645 | 0.0000000 | 0.4082483 | -0.5640761 | 0.4364358 | -0.1973855 | -0.2314550 | -0.2314550 | 0.4308202 | -0.1208734 | -0.3637664 | 0.5539117 | -0.3584641 |
| 0.23 | 56.9 | 65 | 4.05 | 4.07 | 2.31 | 327 | -0.3162278 | -0.2672612 | 0.6324555 | -0.4780914 | -0.3779645 | 0.0000000 | 0.4082483 | -0.5640761 | 0.4364358 | -0.1973855 | 0.0771517 | -0.3857584 | -0.1846372 | 0.3626203 | 0.3209704 | -0.3077287 | -0.5974401 |
| 0.29 | 62.4 | 58 | 4.20 | 4.23 | 2.63 | 334 | 0.3162278 | -0.2672612 | -0.6324555 | -0.4780914 | 0.3779645 | 0.0000000 | -0.4082483 | -0.5640761 | -0.4364358 | -0.1973855 | -0.0771517 | -0.3857584 | 0.1846372 | 0.3626203 | -0.3209704 | -0.3077287 | 0.5974401 |
| 0.31 | 63.3 | 58 | 4.34 | 4.35 | 2.75 | 335 | -0.3162278 | -0.2672612 | 0.6324555 | -0.4780914 | 0.5669467 | 0.5455447 | 0.4082483 | 0.2417469 | 0.1091089 | 0.0328976 | -0.3857584 | 0.0771517 | 0.3077287 | -0.5237849 | 0.4921546 | -0.3077287 | 0.1194880 |
| 0.24 | 62.8 | 57 | 3.94 | 3.96 | 2.48 | 336 | 0.0000000 | -0.5345225 | 0.0000000 | 0.7171372 | 0.5669467 | 0.5455447 | 0.4082483 | 0.2417469 | 0.1091089 | 0.0328976 | 0.2314550 | -0.2314550 | -0.4308202 | -0.1208734 | 0.3637664 | 0.5539117 | 0.3584641 |
step functions. Name three step functions that
weren’t used here and describe what they do.Next, we can specify the model engine that we want to fit:
lm_model <- linear_reg() %>%
set_engine("lm")
We set up a workflow. This step might seem unnecessary now, with only one model and one recipe, but it can make life easier when you are trying out a series of models or several different recipes later on.
lm_wflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(diamonds_recipe)
Finally, we can fit the linear model to the training set:
lm_fit <- fit(lm_wflow, diamonds_train)
We can view the model results:
lm_fit %>%
# This returns the parsnip object:
extract_fit_parsnip() %>%
# Now tidy the linear model object:
tidy()
## # A tibble: 24 Ă— 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5317. 496. 10.7 8.54e- 27
## 2 carat 11282. 54.3 208. 0
## 3 depth -58.4 6.25 -9.33 1.07e- 20
## 4 table -24.1 3.23 -7.45 9.40e- 14
## 5 x -970. 49.0 -19.8 5.16e- 87
## 6 y 9.08 19.9 0.456 6.49e- 1
## 7 z -127. 73.2 -1.74 8.23e- 2
## 8 cut_1 596. 25.0 23.8 1.22e-124
## 9 cut_2 -300. 19.9 -15.0 4.77e- 51
## 10 cut_3 153. 17.2 8.90 5.99e- 19
## # … with 14 more rows
Explain what the intercept represents.
Describe the effect of carat. Is it a significant
predictor of price? Holding everything else constant, what
is the effect on price of a one-unit increase in
carat?
The following code generates predicted values for price
for each observation in the training set:
diamond_train_res <- predict(lm_fit, new_data = diamonds_train %>% select(-price))
diamond_train_res %>%
head()
## # A tibble: 6 Ă— 1
## .pred
## <dbl>
## 1 -1363.
## 2 -667.
## 3 199.
## 4 -815.
## 5 -3472.
## 6 -1373.
Now we attach a column with the actual observed price
observations:
diamond_train_res <- bind_cols(diamond_train_res, diamonds_train %>% select(price))
diamond_train_res %>%
head()
## # A tibble: 6 Ă— 2
## .pred price
## <dbl> <int>
## 1 -1363. 326
## 2 -667. 326
## 3 199. 327
## 4 -815. 334
## 5 -3472. 335
## 6 -1373. 336
We might be interested in a plot of predicted values vs. actual values, here for the training data:
diamond_train_res %>%
ggplot(aes(x = .pred, y = price)) +
geom_point(alpha = 0.2) +
geom_abline(lty = 2) +
theme_bw() +
coord_obs_pred()
It’s fairly clear that the model didn’t do very well. If it predicted every observation accurately, the dots would form a straight line. We also have predicted some negative values for price, and once the actual price is approximately over \(\$5,000\), the model does a pretty poor job.
The odds are that a linear model is simply not the best tool for this
machine learning task. It is likely not an accurate representation of
f(); remember that by using a linear regression, we are
imposing a specific form on the function, rather than learning the
function from the data.
In future labs, we’ll try out different models and compare them. Finally, we can calculate the training root mean squared error (RMSE) and the testing RMSE.
diamond_test_res <- predict(lm_fit, new_data = diamonds_test %>% select(-price))
diamond_test_res <- bind_cols(diamond_test_res, diamonds_test %>% select(price))
rmse(diamond_train_res, truth = price, estimate = .pred)
## # A tibble: 1 Ă— 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1126.
rmse(diamond_test_res, truth = price, estimate = .pred)
## # A tibble: 1 Ă— 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1146.
We can create and view a “metric set” of RMSE, MSE, and \(R^2\) as shown:
diamond_metrics <- metric_set(rmse, rsq, mae)
diamond_metrics(diamond_train_res, truth = price,
estimate = .pred)
## # A tibble: 3 Ă— 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1126.
## 2 rsq standard 0.920
## 3 mae standard 736.
diamond_metrics(diamond_test_res, truth = price,
estimate = .pred)
## # A tibble: 3 Ă— 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1146.
## 2 rsq standard 0.918
## 3 mae standard 751.
Is there a difference between the three metrics for the training data and the testing data?
Do your best to explain why this difference does (or does not) exist.
Now we’ll take the recipe we’ve already created and try fitting a KNN
(k-nearest
neighbors) model with it! To do this, we’ll use the
nearest_neighbor() function, rather than the
linear_reg() function, but we’ll still need to select an
engine. There is only one R package, or engine, that works with
nearest_neighbor(), and that is the kknn
package.